Mach-O Clustering and Yara Signature Generation

Snap into a Yara sig!

Once upon a time we looked at classifying Mach-O, and PE files. This time it's flipped on its head. Is it possible to use various clustering algorithms to group similar files together? But, why stop there!? Can we crank up the awesome and use information from those clusters to generate Yara signatures to find files that are similar in nature?

In this notebook we'll explore not only gathering static information from Mach-O binaries, but clustering on those attributes, and finally show off the capabilities of the Yara signature generation.

Tools

IPython (http://www.ipython.org)
pandas (http://pandas.pydata.org)
macholib (https://pypi.python.org/pypi/macholib/) - version 1.7
Yara (http://plusvic.github.io/yara/)

What we did:

Gathered data about Mach-O files with macholib (JSON)
Read in that data
Data cleanup
Explored the Data!
- Graphs, clustering
Yara signatures
More clustering



In [1]:

    
# All the imports and some basic level setting with various versions
import IPython
import re
import os
import json
import time
import pylab
import string
import pandas
import pickle
import struct
import socket
import collections
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

print "IPython version: %s" %IPython.__version__
print "pandas version: %s" %pd.__version__
print "numpy version: %s" %np.__version__

%matplotlib inline









    



IPython version: 2.0.0
pandas version: 0.13.0rc1-32-g81053f9
numpy version: 1.6.1



In [2]:

    
engines = ['Symantec', 'Sophos', 'F-Prot', 'Kaspersky', 'McAfee', 'Malwarebytes']

def extract_vtdata(data):
    vt = {}
    if 'scans' in data:
        if data['positives'] > 0:
            vt['label'] = 'malicious'
        else:
            vt['label'] = 'nonmalicious'
        vt['positives'] = data['positives']
        for eng in engines:
            if eng in data['scans']:
                if data['scans'][eng]['result'] == None or not 'result' in data['scans'][eng]:
                    vt[eng] = 'no detection'
                else:
                    vt[eng] = data['scans'][eng]['result']
    else:
        vt['label'] = 'no results'
        for eng in engines:
            vt[eng] = 'no results'
        vt['positives'] = 0
    return vt



In [3]:

    
def load_vt_data(file_list):
    import json
    features_list = {}
    for filename in file_list:
        with open(filename,'rb') as f:
            features = extract_vtdata(json.loads(f.read()))
            fname = os.path.split(filename)[1].split('.')[0]
            features_list[fname] = features
    return features_list

import glob
file_list = glob.glob('/Users/user/vt_data/*.vtdata')
vt_data = load_vt_data(file_list)



In [4]:

    
# This simply loads up the JSON and flattens it. FAT binaries are broken down into a feature vector for each architecture
def extract_features(filename, data):
    all_features = []
    if not 'error' in data['characteristics']['macho']:
        for i in range(data['characteristics']['macho']['number of architectures']):
            features = {}
            #features['magic'] = int(data['characteristics']['macho']['header'][i]['magic'], 0)
            #features['h_size'] = data['characteristics']['macho']['header'][i]['size']
            #features['h_offset'] = data['characteristics']['macho']['header'][i]['offset']
            for command in data['verbose']['macho']['header'][i]['commands']:
                if command['cmd_name'] in ['LC_SEGMENT', 'LC_SEGMENT_64']:
                    bits = ''
                    if command['cmd_name'] == 'LC_SEGMENT_64':
                        bits = "64"
                    if command['segname'] == '__PAGEZERO':
                        features['lc_segment_' + bits + '_vmaddr'] = command['vmaddr']
                        features['lc_segment_' + bits + '_vmsize'] = command['vmsize']
                        features['lc_segment_' + bits + '_filesize'] = command['filesize']
                        features['lc_segment_' + bits + '_fileoff'] = command['fileoff']
                if command['cmd_name'] == 'LC_VERSION_MIN_MACOSX':
                    features['lc_version_min_macosx_min_version'] = float('.'.join(command['version'].split('.')[:2]))
                if command['cmd_name'] == 'LC_SYMTAB':
                    features['lc_symtab_strsize'] = command['strsize']
                    features['lc_symtab_stroff'] = command['stroff']
                    features['lc_symtab_symoff'] = command['symoff']
                    features['lc_symtab_nsyms'] = command['nsyms']
                if command['cmd_name'] in ['LC_DYLD_INFO_ONLY', 'LC_DYLD_INFO']:
                    features['lc_dyld_info_lazy_bind_size'] = command['lazy_bind_size']
                    features['lc_dyld_info_rebase_size'] = command['rebase_size']
                    features['lc_dyld_info_lazy_bind_off'] = command['lazy_bind_off']
                    features['lc_dyld_info_export_off'] = command['export_off']
                    features['lc_dyld_info_export_size'] = command['export_size']
                    features['lc_dyld_info_bind_off'] = command['bind_off']
                    features['lc_dyld_info_rebase_off'] = command['rebase_off']
                    features['lc_dyld_info_bind_size'] = command['bind_size']
                    features['lc_dyld_info_weak_bind_size'] = command['weak_bind_size']
                    features['lc_dyld_info_weak_bind_off'] = command['weak_bind_off']
                if command['cmd_name'] == 'LC_DYSYMTAB':
                    features['lc_dysymtab_nextdefsym'] = command['nextdefsym']
                    features['lc_dysymtab_extreloff'] = command['extreloff']
                    features['lc_dysymtab_nlocrel'] = command['nlocrel']
                    features['lc_dysymtab_modtaboff'] = command['modtaboff']
                    features['lc_dysymtab_iundefsym'] = command['iundefsym']
                    features['lc_dysymtab_ntoc'] = command['ntoc']
                    features['lc_dysymtab_ilocalsym'] = command['ilocalsym']
                    features['lc_dysymtab_nundefsym'] = command['nundefsym']
                    features['lc_dysymtab_nextrefsyms'] = command['nextrefsyms']
                    features['lc_dysymtab_locreloff'] = command['locreloff']
                    features['lc_dysymtab_nmodtab'] = command['nmodtab']
                    features['lc_dysymtab_nlocalsym'] = command['nlocalsym']
                    features['lc_dysymtab_tocoff'] = command['tocoff']
                    features['lc_dysymtab_extrefsymoff'] = command['extrefsymoff']
                    features['lc_dysymtab_nindirectsyms'] = command['nindirectsyms']
                    features['lc_dysymtab_iextdefsym'] = command['iextdefsym']
                    features['lc_dysymtab_nextrel'] = command['nextrel']
                    features['lc_dysymtab_indirectsymoff'] = command['indirectsymoff']

            features.update(data['verbose']['macho']['header'][i]['command type count'])
            if 'LC_SEGMENT' in features:
                features['number of segments'] = features['LC_SEGMENT']
            else:
                features['number of segments'] = features['LC_SEGMENT_64']
            features['filename'] = filename[2:-8]
            
            # Remove some more features
            for lc in ['LC_MAIN', 'LC_UNIXTHREAD']:
                if lc in features: features.pop(lc, None)
            
            filename = os.path.split(filename)[1].split('.')[0]
            for eng in engines:
                if filename in vt_data:
                    if eng in vt_data[filename]:
                        features[eng] = vt_data[filename][eng]
                    else:
                        features[eng] = 'no result'
                    features['label'] = vt_data[filename]['label']
                    features['positives'] = vt_data[filename]['positives']
                else:
                    features[eng] = 'no result'
                    features['label'] = 'no result'
            all_features.append(features)
            
    return all_features



In [5]:

    
def load_files(file_list):
    import json
    features_list = []
    for filename in file_list:
        with open(filename,'rb') as f:
            features = extract_features(filename, json.loads(f.read()))
            features_list.extend(features)
    return features_list

import glob
file_list = glob.glob('./*.results')
features = load_files(file_list)
print "Files:", len(file_list)
print "Number of feature vectors:", len(features)









    



Files: 527
Number of feature vectors: 639

Now that all of the files are loaded, once again, there are more feature vectors than files. As mentioned above, this is because FAT binaries are broken down into one vector per contained architecture.

Some examples of the features that are used are:

Load Commands (LC) present in the file (e.g. is LC_FUNCTION_STARTS in the binary)
Specific values in various LC structs (e.g. The major and minor version of OS X for the binary)



In [6]:

    
df = pd.DataFrame.from_records(features)
for col in df.columns:
    if col[0:3] in ['LC_']:
        df[col].fillna(0, inplace=True)
        
df.fillna(-1, inplace=True)
print df.shape
df.head()









    



(639, 75)






    Out[6]:






  
    
      
      F-Prot
      Kaspersky
      LC_CODE_SEGMENT_SPLIT_INFO
      LC_CODE_SIGNATURE
      LC_DATA_IN_CODE
      LC_DYLD_INFO
      LC_DYLD_INFO_ONLY
      LC_DYLIB_CODE_SIGN_DRS
      LC_DYSYMTAB
      LC_ENCRYPTION_INFO
      LC_FUNCTION_STARTS
      LC_ID_DYLIB
      LC_LAZY_LOAD_DYLIB
      LC_LOAD_DYLIB
      LC_LOAD_DYLINKER
      LC_LOAD_WEAK_DYLIB
      LC_REEXPORT_DYLIB
      LC_RPATH
      LC_SEGMENT
      LC_SEGMENT_64
      
    
  
  
    
      0
       no detection
       no detection
       0
       1
       1
       0
       1
       1
       1
       0
       1
       0
       0
       1
       1
       0
       0
       0
       0
       4
      ...
    
    
      1
       no detection
       no detection
       0
       1
       1
       0
       1
       1
       1
       0
       1
       0
       0
       9
       1
       0
       0
       0
       0
       4
      ...
    
    
      2
       no detection
       no detection
       0
       1
       1
       0
       1
       1
       1
       0
       1
       0
       0
       9
       1
       0
       0
       0
       4
       0
      ...
    
    
      3
       no detection
       no detection
       0
       1
       1
       0
       1
       1
       1
       0
       1
       0
       0
       5
       1
       0
       0
       0
       0
       4
      ...
    
    
      4
       no detection
       no detection
       0
       1
       1
       0
       1
       1
       1
       0
       1
       0
       0
       1
       1
       0
       0
       0
       0
       4
      ...
    
  

5 rows × 75 columns



In [7]:

    
# Brief overview of the various things detected by Symantec in this dataset, really we just want to verifiy that we have some data
df[engines].Symantec.value_counts().head(10)









    Out[7]:





no detection       412
OSX.Flashback.K     47
Trojan.Gen.2        36
Backdoor.Trojan     21
OSX.Flashback       15
OSX.Crisis          14
Trojan Horse         8
OSX.MacControl       7
OSX.Kitmos           6
Downloader           5
dtype: int64



In [8]:

    
ignore_cols = engines + ['filename', 'label', 'positives']
cols = [x for x in df.columns.tolist() if not x in ignore_cols]

Time to visualize the raw data! Since it's hard for us humans to visualize things with over 60 dimensions we can use PCA (Principal Component Analysis) to project (reduce the dimensionality) all those features down to a few that we can graph. In this case, we'll look at 2D and 3D images that represent the data. It's interesting to see how much information is lost between 3D and 2D, imagine what this could look like if we could see all the dimensions!

Here we scale the values to set all variables on equal footing, and because it helps PCA work properly (ok, mostly because it helps PCA work properly).



In [9]:

    
X = df.as_matrix(cols)
from sklearn.preprocessing import scale
X = scale(X)

from sklearn.decomposition import PCA
DDD = PCA(n_components=3).fit_transform(X)
DD = PCA(n_components=2).fit_transform(X)



In [10]:

    
from mpl_toolkits.mplot3d import Axes3D

figsize(12,8)
fig = plt.figure(figsize=plt.figaspect(.5))
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.scatter(DDD[:,0], DDD[:,1], DDD[:,2], s=50)
ax.set_title("Raw Data 3D")
ax = fig.add_subplot(1, 2, 2)
ax.scatter(DD[:,0], DD[:,1], s=50)
ax.set_title("Raw Data 2D)")
plt.show()

Let's get into the meat of this. Below is DBSCAN, it enjoys long walks on the beach, non-flat geometry, and uneven cluster sizes (http://scikit-learn.org/stable/modules/clustering.html). This seemed like a good selection for many different reasons. We expect to have several uneven cluster sizes as this sample of files contains both malware and Apple (legit) compiled binaries. By building the features from the file structure, this should pick out several different tool chains (compilers, etc...) used and it would be surprising to have even distributions of that type of information in the data set. Another nice feature of the scikit learn implementation is that all samples that don't belong to a cluster are labeled with "-1". This avoid shoving files into clusters and reducing the efficency of any generated Yara signature. However, if we're searching for more generic sigs we can play games to get more samples in clusters or use different algoritms.

We also show the difference between non-scaled and non-reduced data, and how you can get different (and usually better) results by scaling and reducing.



In [11]:

    
from sklearn.cluster import DBSCAN

X = df.as_matrix(cols)

dbscan = DBSCAN(min_samples=3)
dbscan.fit(X)
labels1 = dbscan.labels_
labels1_u = np.unique(labels1)
nclusters = len(labels1_u)

dbscan_df = pd.DataFrame()
dbscan_df = df[['label', 'filename', 'positives'] + engines]
dbscan_df['cluster'] = labels1

print "Number of clusters: %d" % nclusters
print "Labeled samples: %s" % dbscan_df[dbscan_df['cluster'] != -1].filename.value_counts().sum()
print "Unlabeled samples: %s" % dbscan_df[dbscan_df['cluster'] == -1].filename.value_counts().sum()









    



Number of clusters: 21
Labeled samples: 140
Unlabeled samples: 499

There are a fair amount of unlabeled samples, which is not terribly surprising. However, it should still be interesting to see how the clusering algorithm did, and what got jammed where.

A simple overview of the breakdown of malicious vs. non-malicious. Remember from above, a sample was marked malicious if it had at least 1 result from an AV vendor.



In [12]:

    
dbscan_df.groupby(['cluster', 'label']).count()[['filename']].head(10)









    Out[12]:






  
    
      
      
      filename
    
    
      cluster
      label
      
    
  
  
    
      -1
      malicious
       208
    
    
      nonmalicious
       291
    
    
       0
      nonmalicious
        19
    
    
       1
      malicious
         3
    
    
       2
      nonmalicious
        12
    
    
       3
      malicious
         8
    
    
       4
      malicious
         5
    
    
       5
      malicious
         4
    
    
       6
      malicious
         4
    
    
       7
      nonmalicious
        12
    
  

10 rows × 1 columns

How many clusters had both malicious and non-malicious samples in them?

Zero. Cool!

Remember DBSCAN will assign a label of '-1' to all non-clustered samples.



In [13]:

    
clusters = set()
for name, blah in dbscan_df.groupby(['cluster', 'label'])['label']:
    if name[0] in clusters:
        print "%s Cluster has both Malicious and Non-Malicious Samples" % name[0]
    clusters.add(name[0])









    



-1.0 Cluster has both Malicious and Non-Malicious Samples

In a spot check, this cluster at least, looks pretty good! Looks like similar things were effectivly grouped together. #Science



In [15]:

    
sample_cluster = 3
dbscan_df[dbscan_df['cluster'] == sample_cluster][engines]









    Out[15]:






  
    
      
      Symantec
      Sophos
      F-Prot
      Kaspersky
      McAfee
      Malwarebytes
    
  
  
    
      147
       OSX.Flashback.K
       OSX/Flshplyr-D
       MacOS/FlashBack.A
       Trojan-Downloader.OSX.Flashfake.ab
       OSX/Flashfake.c
       no detection
    
    
      198
       OSX.Flashback.K
       OSX/Flshplyr-D
       MacOS/FlashBack.A
       Trojan-Downloader.OSX.Flashfake.ab
       OSX/Flashfake.c
       no detection
    
    
      246
       OSX.Flashback.K
       OSX/Flshplyr-D
       MacOS/FlashBack.A
       Trojan-Downloader.OSX.Flashfake.ak
       OSX/Flashfake.c
       no detection
    
    
      268
       OSX.Flashback.K
       OSX/Flshplyr-D
       MacOS/FlashBack.A
       Trojan-Downloader.OSX.Flashfake.ab
       OSX/Flashfake.c
       no detection
    
    
      282
       OSX.Flashback.K
       OSX/Flshplyr-D
       MacOS/FlashBack.A
       Trojan-Downloader.OSX.Flashfake.ab
       OSX/Flashfake.c
       no detection
    
    
      318
       OSX.Flashback.K
       OSX/Flshplyr-D
       MacOS/FlashBack.A
       Trojan-Downloader.OSX.Flashfake.ab
       OSX/Flashfake.c
       no detection
    
    
      344
       OSX.Flashback.K
       OSX/Flshplyr-D
       MacOS/FlashBack.A
       Trojan-Downloader.OSX.Flashfake.ab
       OSX/Flashfake.c
       no detection
    
    
      413
          Trojan.Gen.2
       OSX/Flshplyr-D
       MacOS/FlashBack.A
       Trojan-Downloader.OSX.Flashfake.ab
       OSX/Flashfake.c
       no detection
    
  

8 rows × 6 columns



In [16]:

    
for eng in engines:
    print "%s - %s" %(eng, len(dbscan_df[dbscan_df['cluster'] == sample_cluster][eng].unique().tolist()))









    



Symantec - 2
Sophos - 1
F-Prot - 1
Kaspersky - 2
McAfee - 1
Malwarebytes - 1



In [17]:

    
# This is a ballpark to see what might be a good number of components to reduce our original 66 features to
X = df.as_matrix(cols)
X = scale(X)
pca = PCA().fit(X)
n_comp = len([x for x in pca.explained_variance_ if x > 1e0])
print "Number of components w/explained variance > 1: %s" % n_comp









    



Number of components w/explained variance > 1: 18



In [18]:

    
X = df.as_matrix(cols)
X = scale(X)
X = PCA(n_components=n_comp).fit_transform(X)

dbscan = DBSCAN(min_samples=3)
dbscan.fit(X)
labels1 = dbscan.labels_
labels1_u = np.unique(labels1)
nclusters = len(labels1_u)

dbscan_df = pd.DataFrame()
dbscan_df = df[['label', 'filename', 'positives'] + engines]
dbscan_df['cluster'] = labels1

print "Number of clusters: %d" % nclusters
print "Labeled samples: %s" % dbscan_df[dbscan_df['cluster'] != -1].filename.value_counts().sum()
print "Unlabeled samples: %s" % dbscan_df[dbscan_df['cluster'] == -1].filename.value_counts().sum()









    



Number of clusters: 34
Labeled samples: 452
Unlabeled samples: 187

With just a few lines of code we could reduce the number of unlabed (-1) samples and increase the number of labeled samples! These are just some of the ways we can influence the behavior of these algorithms.

You can see below, some of the clusters (the number on the left) and how many members there are in each cluster (the number on the right). This is on the scaled and PCA'd data, and these numbers would look quite different had we ran it above.

Last but not least, there are a few graphs showing our projections from above, but colored with the sample labels. Remember for the PCA'd data above, the graphs are 2D and 3D representations of 18D data, so colors that look pretty close here are probably a bit less close in 18D or even 60D.



In [19]:

    
# Get rid of unlabeled samples and show what we're left with (number of samples per cluster)
dbscan_df.cluster.value_counts().head(10)









    Out[19]:





-1     187
 3     131
 9      30
 13     26
 4      21
 19     21
 22     20
 16     18
 0      17
 18     15
dtype: int64



In [20]:

    
# Remove unlabeled samples for graphing to make it prettier
df['cluster'] = dbscan_df['cluster']
tempdf = df[df['cluster'] != -1].reset_index(drop=True)
X = tempdf.as_matrix(cols)
X = scale(X)
DDD = PCA(n_components=3).fit_transform(X)
DD = PCA(n_components=2).fit_transform(X)

figsize(12,12)
fig = plt.figure(figsize=plt.figaspect(.5))
ax = fig.add_subplot(2, 2, 1, projection='3d')
ax.scatter(DDD[:,0], DDD[:,1], DDD[:,2], c=tempdf['cluster'], s=50)
ax.set_title("DBSCAN Clusters")
ax = fig.add_subplot(2, 2, 2, projection='3d')
ax.set_xlim(-10,5)
ax.set_ylim(-10,15)
ax.set_zlim(-30,5)
ax.scatter(DDD[:,0], DDD[:,1], DDD[:,2], c=tempdf['cluster'], s=50)
ax.set_title("DBSCAN Clusters (zoomed in)")
ax = fig.add_subplot(2, 2, 3)
ax.scatter(DD[:,0], DD[:,1], c=tempdf['cluster'], s=50)
ax.set_title("DBSCAN Clusters")
ax = fig.add_subplot(2, 2, 4)
ax.set_xlim(-6,5)
ax.set_ylim(-15,10)
ax.scatter(DD[:,0], DD[:,1], c=tempdf['cluster'], s=50)
ax.set_title("DBSCAN Clusters (zoomed in)")
plt.show()
#df.drop('cluster', axis=1, inplace=True)



In [21]:

    
dbscan_df.groupby(['cluster', 'label']).count()[['filename']].head(10)









    Out[21]:






  
    
      
      
      filename
    
    
      cluster
      label
      
    
  
  
    
      -1
      malicious
        85
    
    
      nonmalicious
       102
    
    
       0
      malicious
        17
    
    
       1
      malicious
         9
    
    
       2
      malicious
         6
    
    
       3
      nonmalicious
       131
    
    
       4
      nonmalicious
        21
    
    
       5
      nonmalicious
        11
    
    
       6
      malicious
         8
    
    
       7
      malicious
         7
    
  

10 rows × 1 columns



In [22]:

    
clusters = set()
print "Total Number of Clusters: %s\n" % (len(dbscan_df['cluster'].unique().tolist()))
for name, blah in dbscan_df.groupby(['cluster', 'label'])['label']:
    if name[0] in clusters:
        print "%s Cluster has both Malicious and Non-Malicious Samples" % name[0]
    clusters.add(name[0])









    



Total Number of Clusters: 34

-1.0 Cluster has both Malicious and Non-Malicious Samples
15.0 Cluster has both Malicious and Non-Malicious Samples
18.0 Cluster has both Malicious and Non-Malicious Samples
19.0 Cluster has both Malicious and Non-Malicious Samples
20.0 Cluster has both Malicious and Non-Malicious Samples
22.0 Cluster has both Malicious and Non-Malicious Samples



In [23]:

    
sample_cluster = 0
dbscan_df[dbscan_df['cluster'] == sample_cluster][engines]









    Out[23]:






  
    
      
      Symantec
      Sophos
      F-Prot
      Kaspersky
      McAfee
      Malwarebytes
    
  
  
    
      20 
       no detection
         OSX/MSDrop-A
       no detection
       Backdoor.OSX.Getshell.k
       no detection
       no detection
    
    
      72 
       no detection
       OSX/Getshell-A
       no detection
       Backdoor.OSX.Getshell.k
       no detection
       no detection
    
    
      77 
           OSX.Olyx
        OSX/Bckdr-RID
       no detection
          Backdoor.OSX.Lasyr.b
       no detection
       no detection
    
    
      120
       no detection
       OSX/Getshell-A
       no detection
       Backdoor.OSX.Getshell.k
       no detection
       no detection
    
    
      130
         OSX.Olyx.B
        OSX/Bckdr-RID
       no detection
          Backdoor.OSX.Lasyr.b
       no detection
       no detection
    
    
      142
         OSX.Olyx.B
        OSX/Bckdr-RID
       no detection
          Backdoor.OSX.Lasyr.b
       no detection
       no detection
    
    
      180
       no detection
       OSX/Getshell-A
       no detection
       Backdoor.OSX.Getshell.k
       no detection
       no detection
    
    
      184
       OSX.GetShell
        OSX/Bckdr-RMB
       no detection
       Backdoor.OSX.Getshell.k
       no detection
       no detection
    
    
      201
       OSX.GetShell
        OSX/Bckdr-RMB
       no detection
       Backdoor.OSX.Getshell.k
       no detection
       no detection
    
    
      217
       Trojan Horse
       OSX/Getshell-A
       no detection
       Backdoor.OSX.Getshell.k
       no detection
       no detection
    
    
      219
       Trojan.Gen.2
        OSX/Bckdr-RMB
       no detection
       Backdoor.OSX.Getshell.k
       no detection
       no detection
    
    
      253
       no detection
        OSX/Bckdr-RMB
       no detection
       Backdoor.OSX.Getshell.k
       no detection
       no detection
    
    
      310
         OSX.Olyx.B
        OSX/Bckdr-RLI
       no detection
          Backdoor.OSX.Lasyr.e
       no detection
       no detection
    
    
      325
         OSX.Olyx.B
        OSX/Bckdr-RID
       no detection
          Backdoor.OSX.Lasyr.b
       no detection
       no detection
    
    
      387
         OSX.Olyx.B
        OSX/Bckdr-RID
       no detection
          Backdoor.OSX.Lasyr.b
       no detection
       no detection
    
    
      395
       OSX.GetShell
        OSX/Bckdr-RMB
       no detection
       Backdoor.OSX.Getshell.k
       no detection
       no detection
    
    
      530
       no detection
       OSX/Getshell-A
       no detection
       Backdoor.OSX.Getshell.k
       no detection
       no detection
    
  

17 rows × 6 columns



In [24]:

    
for eng in engines:
    print "%s - %s" %(eng, len(dbscan_df[dbscan_df['cluster'] == sample_cluster][eng].unique().tolist()))









    



Symantec - 6
Sophos - 5
F-Prot - 1
Kaspersky - 3
McAfee - 1
Malwarebytes - 1

Perfect, we've got our files clustered in groups that we think should be similar/close to one other. This is great if we wanted to stop here, but how many times can you run a Python model on machines in an enterprise or on an appliance in order to find similar files? I'm willing to bet not very often. Instead we need to get this information out of Python and into something usable on files. For this, we've got Yara!

Below you'll see a simple call-out to a yara_signature python module. This module contains code to generate a signature based on attributes found in the file. We've chosen a cluster (3) and a file from that cluster to base the signature off of. Then the attributes from the cluster that are non-zero (present) are added to the signature. Some of the struct values can be influenced in the sig, and that's the reason for the multiple lists to keep track of various attributes.



In [25]:

    
import yara_signature

name = 3
fdf = pd.DataFrame()
for f in dbscan_df[dbscan_df['cluster'] == name].filename.tolist():
    fdf = fdf.append(df[df['filename'] == f], ignore_index=True)
    
# Choose a signature from cluster to use as the basis of the sig w/the attributes below
filename = fdf.filename.value_counts().index[0]

meta = {"author" : "sconzo", "email" : "sconzo_at_clicksecurity_dot_com"}

sig = yara_signature.yara_macho_generator.YaraMachoGenerator("/Users/sconzo/macho-yara/macho/" +filename, samplename="Cluster_"+str(name), meta=meta)
    
lc_cmds = []
lc_symtab = []
lc_dysymtab = []
lc_dyld_info = []
lc_segment = []
lc_segment_64 = []

for col in fdf.columns:
    if len(fdf[col].unique()) == 1:
        if fdf[col].unique()[0] != 0:
            lower = [s for s in col if s.islower()]
            if fdf[col].unique()[0] > 0 or (len(lower) == len(col)):
                if col.startswith('LC_'):
                    lc_cmds.append(col)
                if col.startswith('lc_segment_'):
                    lc_segment.append(col)
                if col.startswith('lc_segment_64_'):
                    lc_segment_64.append(col)
                if col.startswith('lc_symtab_'):
                    lc_symtab.append(col)
                if col.startswith('lc_dysymtab_'):
                    lc_dysymtab.append(col)
                if col.startswith('lc_dyld_info_'):
                    lc_dyld_info.append(col)

if len(lc_symtab) > 0:
    lc_cmds = [x for x in lc_cmds if x != 'LC_SYMTAB']
    lc_symtab = set([x[10:] for x in lc_symtab])
    sig.add_symtab(lc_symtab)

if len(lc_dysymtab) > 0:
    lc_cmds = [x for x in lc_cmds if x != 'LC_DYSYMTAB']
    lc_dysymtab = set([x[12:] for x in lc_dysymtab])
    sig.add_dysymtab(lc_dysymtab)

if len(lc_dyld_info):
    lc_cmds = [x for x in lc_cmds if x != 'LC_DYLD_INFO']
    lc_cmds = [x for x in lc_cmds if x != 'LC_DYLD_INFO_ONLY']
    lc_dyld_info = set([x[13:] for x in lc_dyld_info])
    sig.add_dyld_info(lc_dyld_info)

if len(lc_segment) > 0:
    lc_cmds = [x for x in lc_cmds if x != 'LC_SEGMENT']
    lc_segment = set([x[12:] for x in lc_segment])
    sig.add_segment(lc_segment)

if len(lc_segment_64) > 0:
    lc_cmds = [x for x in lc_cmds if x != 'LC_SEGMENT_64']
    lc_segment_64 = set([x[14:] for x in lc_segment_64])
    sig.add_segment(lc_segment_64)

if 'LC_VERSION_MIN_IPHONEOS' in lc_cmds:
    lc_cmds = [x for x in lc_cmds if x != 'LC_VERSION_MIN_IPHONEOS']
    sig.add_version_min_macosx()

if 'LC_VERSION_MIN_MACOSX' in lc_cmds:
    lc_cmds = [x for x in lc_cmds if x != 'LC_VERSION_MIN_MACOSX']
    sig.add_version_min_macosx()
[sig.add_lc(x) for x in lc_cmds]

print sig.get_signature()









    



rule Cluster_3
{
meta:
    author = "sconzo"
    email = "sconzo_at_clicksecurity_dot_com"
    generator = "This sweet yara sig generator!"

strings:
    $LC_VERSION_MIN_MACOSX = { 24 00 00 00 10 00 00 00 ?? 09 0a 00 00 }
    $LC_CODE_SIGNATURE_0 = { 1d 00 00 00 ?? 00 00 00 }
    $LC_DATA_IN_CODE_1 = { 29 00 00 00 ?? 00 00 00 }
    $LC_DYLD_INFO_ONLY_2 = { 22 00 00 80 ?? 00 00 00 }
    $LC_DYLIB_CODE_SIGN_DRS_3 = { 2b 00 00 00 ?? 00 00 00 }
    $LC_DYSYMTAB_4 = { 0b 00 00 00 ?? 00 00 00 }
    $LC_FUNCTION_STARTS_5 = { 26 00 00 00 ?? 00 00 00 }
    $LC_LOAD_DYLINKER_6 = { 0e 00 00 00 ?? 00 00 00 }
    $LC_SOURCE_VERSION_7 = { 2a 00 00 00 ?? 00 00 00 }
    $LC_SYMTAB_8 = { 02 00 00 00 ?? 00 00 00 }
    $LC_UUID_9 = { 1b 00 00 00 ?? 00 00 00 }
condition:
    $LC_VERSION_MIN_MACOSX and
    $LC_CODE_SIGNATURE_0 and
    $LC_DATA_IN_CODE_1 and
    $LC_DYLD_INFO_ONLY_2 and
    $LC_DYLIB_CODE_SIGN_DRS_3 and
    $LC_DYSYMTAB_4 and
    $LC_FUNCTION_STARTS_5 and
    $LC_LOAD_DYLINKER_6 and
    $LC_SOURCE_VERSION_7 and
    $LC_SYMTAB_8 and
    $LC_UUID_9
}

Since we've got one method of clustering to Yara signature down, let's take a brief look at what happens to the cluster shapes/distributions with some other types of cluster algoritms.

First up, KMeans. It will put every sample into a cluster, and for this algorithm you must specific the number of cluster (the 'K' in KMeans). There are a bunch of ways you can determine how many clusters, below we went with a simple one from Wikipedia (http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set).



In [26]:

    
from sklearn.cluster import KMeans

X = df.as_matrix(cols)
X = scale(X)

#rule of thumb of k = sqrt(#samples/2), thanks wikipedia :)
k_clusters = int(math.sqrt(int(len(X)/2)))
kmeans = KMeans(n_clusters=k_clusters)
kmeans.fit(X)
labels1 = kmeans.labels_
df['cluster'] = labels1
labels1_u = np.unique(labels1)
nclusters = len(labels1_u)

kmeans_df = pd.DataFrame()
kmeans_df = df[['label', 'filename', 'positives'] + engines]
kmeans_df['cluster'] = labels1

print "Number of clusters: %d" % k_clusters









    



Number of clusters: 17



In [27]:

    
kmeans_df['cluster'].value_counts().head(10)









    Out[27]:





3     161
5     144
1      97
10     81
15     59
11     40
0      26
7      15
6       6
14      2
dtype: int64



In [28]:

    
X = df.as_matrix(cols)
X = scale(X)
X = PCA(n_components=3).fit_transform(X)

figsize(12,8)
fig = plt.figure(figsize=plt.figaspect(.5))
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.scatter(X[:,0], X[:,1], X[:,2], c=kmeans_df['cluster'], s=50)
ax.set_title("Kmeans Clusters")
ax = fig.add_subplot(1, 2, 2, projection='3d')
ax.set_xlim(-10,2)
ax.set_ylim(10,35)
ax.set_zlim(-20,10)
ax.scatter(X[:,0], X[:,1], X[:,2], c=kmeans_df['cluster'], s=50)
ax.set_title("KMeans Clusters (zoomed in)")
plt.show()



In [29]:

    
clusters = set()
for name, blah in kmeans_df.groupby(['cluster', 'label'])['label']:
    if name[0] in clusters:
        print "%s Cluster has both Malicious and Non-Malicious Samples" % name[0]
    clusters.add(name[0])









    



0 Cluster has both Malicious and Non-Malicious Samples
1 Cluster has both Malicious and Non-Malicious Samples
3 Cluster has both Malicious and Non-Malicious Samples
5 Cluster has both Malicious and Non-Malicious Samples
10 Cluster has both Malicious and Non-Malicious Samples
11 Cluster has both Malicious and Non-Malicious Samples
15 Cluster has both Malicious and Non-Malicious Samples



In [30]:

    
sample_cluster = 5
kmeans_df[kmeans_df['cluster'] == sample_cluster][engines].head()









    Out[30]:






  
    
      
      Symantec
      Sophos
      F-Prot
      Kaspersky
      McAfee
      Malwarebytes
    
  
  
    
      19
        OSX.Dockster.A
       OSX/Agent-AADL
            no detection
              Exploit.OSX.CVE-2009-0563.a
          no detection
       no detection
    
    
      20
          no detection
         OSX/MSDrop-A
            no detection
                  Backdoor.OSX.Getshell.k
          no detection
       no detection
    
    
      23
          no detection
        OSX/Spynion-A
            no detection
                     Trojan.OSX.Spynion.a
        OSX/OpinionSpy
       no detection
    
    
      27
            Downloader
       OSX/FakeAv-FFN
            no detection
          Trojan-Downloader.OSX.FavDonw.c
          no detection
       no detection
    
    
      28
       OSX.Flashback.K
       OSX/Flshplyr-D
       MacOS/FlashBack.B
       Trojan-Downloader.OSX.Flashfake.ab
       OSX/Flashfake.c
       no detection
    
  

5 rows × 6 columns



In [31]:

    
for eng in engines:
    print "%s - %s" %(eng, len(kmeans_df[kmeans_df['cluster'] == sample_cluster][eng].unique().tolist()))









    



Symantec - 22
Sophos - 42
F-Prot - 6
Kaspersky - 59
McAfee - 19
Malwarebytes - 1



In [32]:

    
kmeans_df[kmeans_df['cluster'] == sample_cluster]['Symantec'].value_counts()









    Out[32]:





no detection           50
OSX.Flashback.K        23
Trojan.Gen.2           18
Backdoor.Trojan        10
OSX.MacControl          7
OSX.Olyx.B              5
Macsweeper              4
Trojan Horse            3
OSX.Dockster.A          3
OSX.GetShell            3
Downloader              3
OSX.Olyx                2
Trojan.Gen              2
OSX.Coinbitminer        2
OSX.Sabpab              2
OSX.Imauler             1
Spyware.SniperSpy.B     1
Yontoo.B                1
OSX.Imuler              1
MACDefender             1
OSX.Hackback            1
OSX.Revir               1
dtype: int64



In [33]:

    
X = df.as_matrix(cols)
X = scale(X)
X = PCA(n_components=n_comp).fit_transform(X)

#rule of thumb of k = sqrt(#samples/2), thanks wikipedia :)
k_clusters = int(math.sqrt(int(len(X)/2)))

kmeans = KMeans(n_clusters=k_clusters)
kmeans.fit(X)
labels1 = kmeans.labels_
df['cluster'] = labels1
labels1_u = np.unique(labels1)
nclusters = len(labels1_u)

kmeans_df = pd.DataFrame()
kmeans_df = df[['label', 'filename', 'positives'] + engines]
kmeans_df['cluster'] = labels1

print "Number of clusters: %d" % nclusters
print
print "Cluster/Sample Layout"
print df.cluster.value_counts().head(10)
print

X = df.as_matrix(cols)
X = scale(X)
X = PCA(n_components=3).fit_transform(X)

figsize(12,8)
fig = plt.figure(figsize=plt.figaspect(.5))
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.scatter(X[:,0], X[:,1], X[:,2], c=kmeans_df['cluster'], s=50)
ax.set_title("KMeans Clusters")
ax = fig.add_subplot(1, 2, 2, projection='3d')
ax.set_xlim(-10,2)
ax.set_ylim(15,30)
ax.set_zlim(-20,0)
ax.scatter(X[:,0], X[:,1], X[:,2], c=kmeans_df['cluster'], s=50)
ax.set_title("KMeans Clusters (zoomed in)")
plt.show()









    



Number of clusters: 17

Cluster/Sample Layout
0     154
15    151
9      97
1      81
16     53
7      40
4      28
14     17
11      6
12      4
dtype: int64

Above you can see how scaling and PCA lead to a bit more balanced layout of some of the clusters, but we've still got some outliers. Not a huge deal, just another way to slice and look at the data.



In [34]:

    
clusters = set()
for name, blah in kmeans_df.groupby(['cluster', 'label'])['label']:
    if name[0] in clusters:
        print "%s Cluster has both Malicious and Non-Malicious Samples" % name[0]
    clusters.add(name[0])









    



0 Cluster has both Malicious and Non-Malicious Samples
1 Cluster has both Malicious and Non-Malicious Samples
4 Cluster has both Malicious and Non-Malicious Samples
7 Cluster has both Malicious and Non-Malicious Samples
9 Cluster has both Malicious and Non-Malicious Samples
15 Cluster has both Malicious and Non-Malicious Samples
16 Cluster has both Malicious and Non-Malicious Samples



In [38]:

    
sample_cluster = 4
kmeans_df[kmeans_df['cluster'] == sample_cluster][engines].head()









    Out[38]:






  
    
      
      Symantec
      Sophos
      F-Prot
      Kaspersky
      McAfee
      Malwarebytes
    
  
  
    
      29 
          no detection
         no detection
       no detection
                  no detection
       no detection
       no detection
    
    
      57 
          no detection
       OSX/MusMinim-A
       no detection
       Backdoor.OSX.BlackHol.g
       no detection
       no detection
    
    
      65 
       Backdoor.Trojan
       OSX/MusMinim-C
       no detection
       Backdoor.OSX.BlackHol.b
       no detection
       no detection
    
    
      118
          no detection
         no detection
       no detection
                  no detection
       no detection
       no detection
    
    
      119
          no detection
         no detection
       no detection
                  no detection
       no detection
       no detection
    
  

5 rows × 6 columns



In [39]:

    
for eng in engines:
    print "%s - %s" %(eng, len(kmeans_df[kmeans_df['cluster'] == sample_cluster][eng].unique().tolist()))









    



Symantec - 8
Sophos - 14
F-Prot - 2
Kaspersky - 12
McAfee - 4
Malwarebytes - 1

Below we're looking at MeanShift. Scikit learn is nice enough to tell us a bit about MeanShift usecases (Many clusters, uneven cluster size, non-flat geometry). This seems to, once again, fit our data pretty well. Maybe we can get some better/different layouts of clusters here.



In [40]:

    
from sklearn.cluster import MeanShift, estimate_bandwidth

X = df.as_matrix(cols)
X = scale(X)

ebw = estimate_bandwidth(X)
ms1 = MeanShift(bandwidth=ebw)
ms1.fit(X)

labels1 = ms1.labels_
labels1_u = np.unique(labels1)
nclusters = len(labels1_u)

meanshift_df = pd.DataFrame()
meanshift_df = df[['label', 'filename', 'positives'] + engines]
meanshift_df['cluster'] = labels1

print "Estimated Bandwidth: %s" % ebw
print "Number of clusters: %d" % nclusters









    



Estimated Bandwidth: 7.16015407847
Number of clusters: 29



In [41]:

    
X = df.as_matrix(cols)
X = scale(X)
X = PCA(n_components=3).fit_transform(X)

figsize(12,8)
fig = plt.figure(figsize=plt.figaspect(.5))
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.scatter(X[:,0], X[:,1], X[:,2], c=meanshift_df['cluster'], s=50)
ax.set_title("MeanShift Clusters")
ax = fig.add_subplot(1, 2, 2, projection='3d')
ax.set_xlim(-10,2)
ax.set_ylim(15,30)
ax.set_zlim(-20,0)
ax.scatter(X[:,0], X[:,1], X[:,2], c=meanshift_df['cluster'], s=50)
ax.set_title("MeanShift Clusters (zoomed in)")
plt.show()



In [42]:

    
meanshift_df.cluster.value_counts().head(10)









    Out[42]:





0    564
1     21
2      8
3      6
4      6
8      4
5      3
6      2
7      2
9      2
dtype: int64



In [43]:

    
clusters = set()
for name, blah in meanshift_df.groupby(['cluster', 'label'])['label']:
    if name[0] in clusters:
        print "%s Cluster has both Malicious and Non-Malicious Samples" % name[0]
    clusters.add(name[0])









    



0 Cluster has both Malicious and Non-Malicious Samples
1 Cluster has both Malicious and Non-Malicious Samples



In [44]:

    
sample_cluster = 2
meanshift_df[meanshift_df['cluster'] == sample_cluster][engines].head()









    Out[44]:






  
    
      
      Symantec
      Sophos
      F-Prot
      Kaspersky
      McAfee
      Malwarebytes
    
  
  
    
      136
       OSX.Crisis
       OSX/Morcut-E
       no detection
       Backdoor.OSX.Morcut.m
       RDN/Generic BackDoor!ea
       no detection
    
    
      137
       OSX.Crisis
       OSX/Morcut-E
       no detection
       Backdoor.OSX.Morcut.m
       RDN/Generic BackDoor!ea
       no detection
    
    
      162
        no result
       OSX/Morcut-D
       no detection
       Backdoor.OSX.Morcut.c
                  no detection
       no detection
    
    
      163
        no result
       OSX/Morcut-D
       no detection
       Backdoor.OSX.Morcut.c
                  no detection
       no detection
    
    
      273
       OSX.Crisis
       OSX/Morcut-A
       no detection
       Backdoor.OSX.Morcut.a
                    OSX/Morcut
       no detection
    
  

5 rows × 6 columns



In [45]:

    
for eng in engines:
    print "%s - %s" %(eng, len(kmeans_df[kmeans_df['cluster'] == sample_cluster][eng].unique().tolist()))









    



Symantec - 1
Sophos - 1
F-Prot - 1
Kaspersky - 1
McAfee - 1
Malwarebytes - 1



In [46]:

    
X = df.as_matrix(cols)
X = scale(X)
X = PCA(n_components=n_comp).fit_transform(X)

ebw = estimate_bandwidth(X)
ms1 = MeanShift(bandwidth=ebw)
ms1.fit(X)

labels1 = ms1.labels_
labels1_u = np.unique(labels1)
nclusters = len(labels1_u)

meanshift_df = pd.DataFrame()
meanshift_df = df[['label', 'filename', 'positives'] + engines]
meanshift_df['cluster'] = labels1

print "Estimated Bandwidth: %s" % ebw
print "Number of clusters: %d" % nclusters
print
print "Cluster/Sample Layout"
print df.cluster.value_counts().head(10)
print

df['cluster'] = meanshift_df['cluster']
# Once again we can remove, in this case, the largest cluster for a less dense graph
tempdf = df[df['cluster'] != 0].reset_index(drop=True)
X = tempdf.as_matrix(cols)
X = scale(X)
X = PCA(n_components=3).fit_transform(X)

figsize(12,8)
fig = plt.figure(figsize=plt.figaspect(.5))
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.scatter(X[:,0], X[:,1], X[:,2], c=tempdf['cluster'], s=50)
ax.set_title("MeanShift Clusters")
ax = fig.add_subplot(1, 2, 2, projection='3d')
ax.set_xlim(-10,2)
ax.set_ylim(15,30)
ax.set_zlim(-20,0)
ax.scatter(X[:,0], X[:,1], X[:,2], c=tempdf['cluster'], s=50)
ax.set_title("MeanShift Clusters (zoomed in)")
plt.show()









    



Estimated Bandwidth: 6.11808339375
Number of clusters: 25

Cluster/Sample Layout
0     154
15    151
9      97
1      81
16     53
7      40
4      28
14     17
11      6
12      4
dtype: int64



In [47]:

    
clusters = set()
for name, blah in meanshift_df.groupby(['cluster', 'label'])['label']:
    if name[0] in clusters:
        print "%s Cluster has both Malicious and Non-Malicious Samples" % name[0]
    clusters.add(name[0])









    



0 Cluster has both Malicious and Non-Malicious Samples
1 Cluster has both Malicious and Non-Malicious Samples



In [48]:

    
sample_cluster = 2
meanshift_df[meanshift_df['cluster'] == sample_cluster][engines].head()









    Out[48]:






  
    
      
      Symantec
      Sophos
      F-Prot
      Kaspersky
      McAfee
      Malwarebytes
    
  
  
    
      136
       OSX.Crisis
       OSX/Morcut-E
       no detection
       Backdoor.OSX.Morcut.m
       RDN/Generic BackDoor!ea
       no detection
    
    
      137
       OSX.Crisis
       OSX/Morcut-E
       no detection
       Backdoor.OSX.Morcut.m
       RDN/Generic BackDoor!ea
       no detection
    
    
      162
        no result
       OSX/Morcut-D
       no detection
       Backdoor.OSX.Morcut.c
                  no detection
       no detection
    
    
      163
        no result
       OSX/Morcut-D
       no detection
       Backdoor.OSX.Morcut.c
                  no detection
       no detection
    
    
      273
       OSX.Crisis
       OSX/Morcut-A
       no detection
       Backdoor.OSX.Morcut.a
                    OSX/Morcut
       no detection
    
  

5 rows × 6 columns



In [49]:

    
for eng in engines:
    print "%s - %s" %(eng, len(kmeans_df[kmeans_df['cluster'] == sample_cluster][eng].unique().tolist()))









    



Symantec - 1
Sophos - 1
F-Prot - 1
Kaspersky - 1
McAfee - 1
Malwarebytes - 1

It seems we've run into a similar case with MeanShift as with DBSCAN. Instead of being unlabed, we wound up with one cluster with the majority of samples. In the second set of graphs the large cluster was removed in order to better see the remaining samples.

Overall, it's important to see how using different algorithms can impact the end result. Understanding that impact when trying to transfer knowledge from one domain to another is also important. This way it's possible to see how the various cluster techniques can lead to different Yara signatures which will fire on different sets of files. When dealing with large amounts of malware, this is one way to group existing and detect new potential variants of the same family.

Good luck and happy hunting!

	F-Prot	Kaspersky	LC_CODE_SIGNATURE	LC_DATA_IN_CODE	LC_DYLD_INFO_ONLY	LC_DYLIB_CODE_SIGN_DRS	LC_DYSYMTAB	LC_FUNCTION_STARTS	LC_LOAD_DYLIB	LC_LOAD_DYLINKER	LC_SEGMENT	LC_SEGMENT_64
0	no detection	no detection	1	1	1	1	1	1	1	1	0	4	...
1	no detection	no detection	1	1	1	1	1	1	9	1	0	4	...
2	no detection	no detection	1	1	1	1	1	1	9	1	4	0	...
3	no detection	no detection	1	1	1	1	1	1	5	1	0	4	...
4	no detection	no detection	1	1	1	1	1	1	1	1	0	4	...

		filename
cluster	label
-1	malicious	208
-1	nonmalicious	291
0	nonmalicious	19
1	malicious	3
2	nonmalicious	12
3	malicious	8
4	malicious	5
5	malicious	4
6	malicious	4
7	nonmalicious	12

	Symantec	Sophos	F-Prot	Kaspersky	McAfee	Malwarebytes
147	OSX.Flashback.K	OSX/Flshplyr-D	MacOS/FlashBack.A	Trojan-Downloader.OSX.Flashfake.ab	OSX/Flashfake.c	no detection
198	OSX.Flashback.K	OSX/Flshplyr-D	MacOS/FlashBack.A	Trojan-Downloader.OSX.Flashfake.ab	OSX/Flashfake.c	no detection
246	OSX.Flashback.K	OSX/Flshplyr-D	MacOS/FlashBack.A	Trojan-Downloader.OSX.Flashfake.ak	OSX/Flashfake.c	no detection
268	OSX.Flashback.K	OSX/Flshplyr-D	MacOS/FlashBack.A	Trojan-Downloader.OSX.Flashfake.ab	OSX/Flashfake.c	no detection
282	OSX.Flashback.K	OSX/Flshplyr-D	MacOS/FlashBack.A	Trojan-Downloader.OSX.Flashfake.ab	OSX/Flashfake.c	no detection
318	OSX.Flashback.K	OSX/Flshplyr-D	MacOS/FlashBack.A	Trojan-Downloader.OSX.Flashfake.ab	OSX/Flashfake.c	no detection
344	OSX.Flashback.K	OSX/Flshplyr-D	MacOS/FlashBack.A	Trojan-Downloader.OSX.Flashfake.ab	OSX/Flashfake.c	no detection
413	Trojan.Gen.2	OSX/Flshplyr-D	MacOS/FlashBack.A	Trojan-Downloader.OSX.Flashfake.ab	OSX/Flashfake.c	no detection

	Symantec	Sophos	F-Prot	Kaspersky	McAfee	Malwarebytes
20	no detection	OSX/MSDrop-A	no detection	Backdoor.OSX.Getshell.k	no detection	no detection
72	no detection	OSX/Getshell-A	no detection	Backdoor.OSX.Getshell.k	no detection	no detection
77	OSX.Olyx	OSX/Bckdr-RID	no detection	Backdoor.OSX.Lasyr.b	no detection	no detection
120	no detection	OSX/Getshell-A	no detection	Backdoor.OSX.Getshell.k	no detection	no detection
130	OSX.Olyx.B	OSX/Bckdr-RID	no detection	Backdoor.OSX.Lasyr.b	no detection	no detection
142	OSX.Olyx.B	OSX/Bckdr-RID	no detection	Backdoor.OSX.Lasyr.b	no detection	no detection
180	no detection	OSX/Getshell-A	no detection	Backdoor.OSX.Getshell.k	no detection	no detection
184	OSX.GetShell	OSX/Bckdr-RMB	no detection	Backdoor.OSX.Getshell.k	no detection	no detection
201	OSX.GetShell	OSX/Bckdr-RMB	no detection	Backdoor.OSX.Getshell.k	no detection	no detection
217	Trojan Horse	OSX/Getshell-A	no detection	Backdoor.OSX.Getshell.k	no detection	no detection
219	Trojan.Gen.2	OSX/Bckdr-RMB	no detection	Backdoor.OSX.Getshell.k	no detection	no detection
253	no detection	OSX/Bckdr-RMB	no detection	Backdoor.OSX.Getshell.k	no detection	no detection
310	OSX.Olyx.B	OSX/Bckdr-RLI	no detection	Backdoor.OSX.Lasyr.e	no detection	no detection
325	OSX.Olyx.B	OSX/Bckdr-RID	no detection	Backdoor.OSX.Lasyr.b	no detection	no detection
387	OSX.Olyx.B	OSX/Bckdr-RID	no detection	Backdoor.OSX.Lasyr.b	no detection	no detection
395	OSX.GetShell	OSX/Bckdr-RMB	no detection	Backdoor.OSX.Getshell.k	no detection	no detection
530	no detection	OSX/Getshell-A	no detection	Backdoor.OSX.Getshell.k	no detection	no detection

	Symantec	Sophos	F-Prot	Kaspersky	McAfee	Malwarebytes
19	OSX.Dockster.A	OSX/Agent-AADL	no detection	Exploit.OSX.CVE-2009-0563.a	no detection	no detection
20	no detection	OSX/MSDrop-A	no detection	Backdoor.OSX.Getshell.k	no detection	no detection
23	no detection	OSX/Spynion-A	no detection	Trojan.OSX.Spynion.a	OSX/OpinionSpy	no detection
27	Downloader	OSX/FakeAv-FFN	no detection	Trojan-Downloader.OSX.FavDonw.c	no detection	no detection
28	OSX.Flashback.K	OSX/Flshplyr-D	MacOS/FlashBack.B	Trojan-Downloader.OSX.Flashfake.ab	OSX/Flashfake.c	no detection

	Symantec	Sophos	F-Prot	Kaspersky	McAfee	Malwarebytes
136	OSX.Crisis	OSX/Morcut-E	no detection	Backdoor.OSX.Morcut.m	RDN/Generic BackDoor!ea	no detection
137	OSX.Crisis	OSX/Morcut-E	no detection	Backdoor.OSX.Morcut.m	RDN/Generic BackDoor!ea	no detection
162	no result	OSX/Morcut-D	no detection	Backdoor.OSX.Morcut.c	no detection	no detection
163	no result	OSX/Morcut-D	no detection	Backdoor.OSX.Morcut.c	no detection	no detection
273	OSX.Crisis	OSX/Morcut-A	no detection	Backdoor.OSX.Morcut.a	OSX/Morcut	no detection